Skip to content

Activate FlashHead under vllm serve#2

Open
WilhelmTr wants to merge 1 commit intomasterfrom
fix/vllm-serve-activation
Open

Activate FlashHead under vllm serve#2
WilhelmTr wants to merge 1 commit intomasterfrom
fix/vllm-serve-activation

Conversation

@WilhelmTr
Copy link
Copy Markdown

@WilhelmTr WilhelmTr commented Apr 22, 2026

Summary

  • Add patch_async_llm targeting AsyncLLM.__init__, so the FlashHead metadata-load runs under both the Python LLM(...) API and vllm serve. The existing patch_llm only covers LLMEngine.from_engine_args, which vllm serve never reaches in vLLM 0.19 (the OpenAI entrypoint goes through AsyncLLM.from_vllm_config then AsyncLLM.__init__).
  • Drop the negative-result cache in logits_processor._get_flash_head so a metadata file that appears after server startup is still picked up on the next decode step.
  • Bump to 0.1.10 to trigger the PyPI release.

What went wrong today (repro)

With flash-head==0.1.9 installed against vllm==0.19.1, running

vllm serve embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead \
    --max-model-len 8192 --gpu-memory-utilization 0.75 --max-num-seqs 2

starts up and serves correctly, but /tmp/flashhead_metadata.pt is never written. get_flash_head() returns None, and the patched LogitsProcessor._get_logits falls straight through to the original dense path on every decode step, so FlashHead is effectively disabled under vllm serve, silently.

Traced through: vllm.entrypoints.openai.api_server.build_async_engine_client_from_engine_args calls AsyncLLM.from_vllm_config which calls AsyncLLM.__init__. LLMEngine.from_engine_args (the legacy class the current patch targets) is never called. The Python LLM(...) API still works because LLM.__init__ does call LLMEngine.from_engine_args.

Verification

Before: no [FlashHead] Loaded lazily... log from either process after startup, no /tmp/flashhead_metadata.pt, dense-head fallback.

After, with this PR:

flash_head.patches.async_llm INFO [FlashHead] Patched AsyncLLM.__init__
flash_head.patches INFO [FlashHead] All patches applied
flash_head INFO [FlashHead] Plugin registered
flash_head.loading INFO [FlashHead] Metadata prepared for lazy loading from flash_head_assets
flash_head.patches.async_llm INFO [FlashHead] Metadata saved for model: embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead
...
flash_head.loading INFO [FlashHead] Loaded lazily on GPU using 'lm_head.weight'

Exact curl from the README returns a coherent detailed video description.

Note (not fixed here)

vLLM's DEFAULT_LOGGING_CONFIG only attaches a handler to the vllm logger, so every [FlashHead] ... INFO line is dropped unless the user sets VLLM_LOGGING_CONFIG_PATH to a config that includes a flash_head logger. Worth either adding a handler inside register(), or mentioning in the README that the activation banner won't appear under vllm serve by default.

vLLM 0.19 reaches `AsyncLLM.__init__` through `AsyncLLM.from_vllm_config`
for the OpenAI server, skipping `LLMEngine.from_engine_args`. That left
`set_flash_head(metadata)` uncalled under `vllm serve`, so the patched
`_get_logits` always saw `get_flash_head() is None` and silently fell
back to the dense lm_head on every decode step.

Add a mirror of patch_llm that targets `AsyncLLM.__init__` so the
metadata is written on both paths, and stop caching the None result in
`_get_flash_head` so a late-arriving metadata file is picked up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WilhelmTr WilhelmTr requested a review from JonnaMat April 22, 2026 14:17
logger = logging.getLogger(__name__)

# Sentinel for lazy loading
_FLASH_HEAD_NOT_LOADED = object()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed since get_flash_head() may be None (e.g., when running non-FlashHead models).

return None


def patch_async_llm():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a guard for idempotence (similar to what we do in logits_processor.py) [if _flash_head is None:...]

While AsyncLLm is run only once per engine construction (not per decode / request) there may be other parts of vllm that call it. We could add a _FLASH_HEAD_NOT_LOADED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants